# Sandbox-RL: Scalable Multi-Model Optimization through Sandbox-Based Reinforcement Learning

<div align="center">

<img src="assets/logo.png" alt="Core SRL Logo" width="200">

**Scalable Multi-Model Optimization through Sandbox-Based Reinforcement Learning**

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

</div>

## What is Sandbox-RL?

Sandbox-RL enables **scalable multi-model optimization** through **sandbox-based reinforcement learning**. Train 4-8 modern LLMs like Qwen3-14B simultaneously with **cooperative-competitive dynamics** and real-time weight updates.

### Key Features

- **Multi-Model Training**: Simultaneous RL training of 4-8 modern LLMs
- **Live Weight Updates**: Real-time parameter synchronization during training  
- **Cooperative-Competitive RL**: Novel algorithm balancing cooperation and competition
- **Modern Model Support**: Qwen3-14B, Llama-3.1, and other open-weight models
- **VERL/AReaL Integration**: Efficient training with advanced caching
- **Checkpoint Management**: Automatic saving and recovery

## System Architecture

![System Architecture](assets/archi.jpeg)

```
┌─────────────────────────────────────────────────────────────────┐
│                    Sandbox-RL Architecture                      │
├─────────────────────────────────────────────────────────────────┤
│  Multi-Model Trainer                                           │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐│
│  │   Qwen3-14B │ │   Qwen-Math │ │ Qwen-Coder  │ │ Llama-3.1   ││
│  │   + LoRA    │ │   + LoRA    │ │   + LoRA    │ │   + LoRA    ││
│  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘│
│         │               │               │               │        │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │           Cooperative-Competitive RL Engine               │ │
│  │  • Weight Update Coordination  • Parameter Sharing        │ │
│  │  • VERL Integration           • AReaL Optimization        │ │
│  └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```

### Training Flow Architecture

```
                    ┌─────────────────────────────────────────┐
                    │         User Interface & API           │
                    └─────────────────┬───────────────────────┘
                                      │
                    ┌─────────────────▼───────────────────────┐
                    │         Task Scheduler                  │
                    │  ┌─────────┐ ┌─────────┐ ┌─────────┐   │
                    │  │  GPU 0  │ │  GPU 1  │ │  GPU 2  │   │
                    │  └─────────┘ └─────────┘ └─────────┘   │
                    └─────────────────┬───────────────────────┘
                                      │
         ┌────────────────────────────┼────────────────────────────┐
         │                            │                            │
    ┌────▼────┐                  ┌────▼────┐                  ┌────▼────┐
    │ Model A │◄────────────────►│ Model B │◄────────────────►│ Model C │
    │ Training│  Cooperation     │ Training│  Competition     │ Training│
    │ Instance│                  │ Instance│                  │ Instance│
    └────┬────┘                  └────┬────┘                  └────┬────┘
         │                            │                            │
         └────────────────────────────┼────────────────────────────┘
                                      │
                    ┌─────────────────▼───────────────────────┐
                    │        System Resource Manager         │
                    │  ┌─────────────┐ ┌─────────────────────┐│
                    │  │ Memory Pool │ │ Network Bandwidth   ││
                    │  │ Management  │ │ & RDMA Controller   ││
                    │  └─────────────┘ └─────────────────────┘│
                    └─────────────────┬───────────────────────┘
                                      │
                    ┌─────────────────▼───────────────────────┐
                    │     VERL/AReaL Optimization Engine     │
                    │  ┌─────────────┐ ┌─────────────────────┐│
                    │  │ KV Cache    │ │ Weight Synchronizer ││
                    │  │ Manager     │ │ & Gradient Merger   ││
                    │  └─────────────┘ └─────────────────────┘│
                    └─────────────────┬───────────────────────┘
                                      │
                    ┌─────────────────▼───────────────────────┐
                    │       Checkpoint & Monitoring          │
                    │  ┌─────────┐ ┌─────────┐ ┌─────────┐   │
                    │  │ Save/   │ │ Metrics │ │ System  │   │
                    │  │ Load    │ │ Logger  │ │ Health  │   │
                    │  │ States  │ │         │ │ Monitor │   │
                    │  └─────────┘ └─────────┘ └─────────┘   │
                    └─────────────────────────────────────────┘
```

## Quick Start

### Installation

```bash
git clone https://github.com/NoakLiu/SandBox-RL.git
cd core-srl
pip install -r requirements.txt
```

### Basic Training

```python
import asyncio
from core_srl import quick_start_multimodel_training

async def main():
    results = await quick_start_multimodel_training(
        num_models=4,
        max_episodes=100
    )
    print(f"Training completed: {results['status']}")

asyncio.run(main())
```

### Advanced Configuration

```python
from core_srl import MultiModelTrainer, MultiModelConfig, TrainingMode

config = MultiModelConfig(
    num_models=6,
    model_types=["qwen3", "qwen_coder", "llama3"],
    training_mode=TrainingMode.MIXED,
    max_episodes=1000,
    checkpoint_dir="./my_checkpoints"
)

trainer = MultiModelTrainer(config)
results = asyncio.run(trainer.train())
```

### Checkpoint Management

```python
from core_srl import list_available_checkpoints

# List checkpoints
checkpoints = list_available_checkpoints()
print("Available:", checkpoints)

# Resume training
trainer.load_checkpoint(checkpoints[0])
```

## Supported Models

```python
MODERN_MODELS = {
    "qwen3": "Qwen/Qwen2.5-14B-Instruct",           # Latest Qwen
    "qwen_coder": "Qwen/Qwen2.5-Coder-14B-Instruct", # Code specialized
    "qwen_math": "Qwen/Qwen2.5-Math-14B-Instruct",   # Math specialized
    "llama3": "meta-llama/Llama-3.1-8B-Instruct"     # Latest Llama
}
```

## 📁 Project Structure

```
core-srl/
├── core_srl/           # Core framework (8 files)
├── examples/           # Training examples (8 examples)
├── tests/              # Test suites
├── docs/               # Documentation (6 docs)
├── data/               # Training data and results
└── checkpoints/        # Model checkpoints
```

## 📚 Documentation

- **[Quick Start](docs/quick_start.md)** - 5-minute setup
- **[Multi-Model Training](docs/multimodel_training.md)** - Training guide
- **[Model Configuration](docs/model_config.md)** - Modern LLM setup
- **[Checkpoints](docs/checkpoints.md)** - Save/restore training
- **[VERL/AReaL](docs/verl_areal.md)** - Advanced optimization
- **[API Reference](docs/api_reference.md)** - Complete API

## Contributing

Focus areas:
- New modern LLM integrations
- Advanced multi-model strategies
- Performance optimizations

## 📄 License

MIT License

---

<div align="center">
<b>Sandbox-RL v2.0.0 - Scalable Multi-Model Optimization Made Efficient</b>
</div>